Move coroutine upvars into locals for better memory economy #135527

dingxiangfei2009 · 2025-01-15T10:28:29Z

Replace #127522
Related to #62958

The problem statement

#62958 demonstrates two problems. One is that upvars are always unconditionally promoted to prefix data fields of the state machine; the other is that the opportunity to achieve a more compact data layout is lost because captured upvars are not subjected to liveness analysis, in the sense that the memory space at one point occupied by upvars is never reclaimed and made available for other saved data across certain yield points, even when they are dead at those suspension locations.

The second problem is better demonstrated with this code snippet.

async fn work(another_fut: impl Future) {
    let _ = another_fut.await;
    // now `another_fut` is consumed
    let next_fut = async { .. };
    next_fut.await;
}

// `work`'s layout needs to reserve space for both `another_fut` and `next_fut`, while there is a clear missed opportunity
// to overlap the memory for `another_fut` and `next_fut` for better memory economy.

The difficulty lies with the fact that captured upvars do not receive their own locals inside a coroutine body. If we can assign locals to them somehow, we can run the layout scheme as usual and the optimisation on the data layout comes into effect out of the box in most cases.

Proposed changes

This is an initial work to improve memory economy of coroutine and async futures, by reducing the unnecessary of promotion of captured upvars into state prefix.

The patch contains the following changes.

Introduction of a RelocateUpvar MIR pass that inserts a MIR gadget, through which captured values by coroutine or async bodies or closures are moved into the inner MIR locals. This opens opportunities to subject the captured upvars to the same liveness analysis and determine which are the necessary ones to be stored in the coroutine state during suspension.
With this gadget, it means that we do not have to keep all upvars in the so-called prefix data regions of coroutine states. Instead, they are moved into the Unresumed state, or by convention the first variants of the state ADTs.
In addition, in case that some upvars are eventually used across more than one suspension point, which leads to their promotion into the prefix after all, we further arrange the coroutine state data layout, so that their offsets in the Unresumed state coincide with their memory slots after promotion. This means that during codegen, the additional moves introduced by the RelocateUpvar gadget are actually elided. The relevant change is implemented in rustc_abi.
We then have to pay the lip service to translate direct field access to the upvars into access behind the Unresumed variant.
We assert invariance relation between types of the captured upvars and types of the respective relocated locals.
We have to update diagnostics so that they are more informed about captured values and they make more sense in view of this change.
As requested by the review comments, the relocation only applies behind an unstable compiler flag -Z pack-coroutine-layout=captures-only. The default is pack-coroutine-layout=no, so that we keep the layout aligned with the stable.

Other than upvars, the coroutine state data layout scheme remains largely the same.

Design decisions

Why does this patch not perform relocation as part of the `StateTransform` pass?

This idea is explored in #120168 already back in 2023. The conclusion then was that it does not interact well with MIR dataflow analysis. It requires StateTransform pass to assign a virtual "MIR local" to each upvars at the beginning. Apparently this created difficulty in reviewing the piece as soon as we overload this huge StateTransform pass with this additional renumbering work. The idea has always been that it is better to perform the renumbering in its own pass, to keep StateTransform simple.

This patch has gone further to carry out the re-write as early as possible, so that the passes in between can perform rewrites as per current MIR local semantics and optimisation rules.

Further optimisation to be implemented behind a feature gate

Point 4 mentions that any local to be saved across suspensions will be promoted whenever they are alive across two or more yield locations. We would like to run an experiment behind a feature gate on improvements of the layout scheme. For ease of reviewing, it is better to drop this part of work from this PR. Nevertheless, the idea runs along the implementation in #127522 and we intend to propose a second PR just for that.

Old PR description

Good day, this PR is related to #127522 and it is made easier to the public to test out a new coroutine/`async` state machine directly.

Prepare the compiler for tests

For starter, you may build the compiler as prescribed in the rustc-dev-guide instruction. If a test in the docker container is desirable, you may build this compiler with src/ci/docker/run.sh dist-x86_64-linux --dev for x86_64 and package the compiler with ../x dist to produce the artifacts in obj/dist-x86_64-linux/build/dist. This Dockerfile gets you a working Rust builder image which allows you to build your Rust applications in bookworm.

The state of performance

So far with this patch, I have been studying the performance impact on the cases of tokio's single- and multi-threaded runtime, as well as a simple axum HTTP service. As far as I can see, I can find a change in performance characteristics that are statistically significant, one-sided p = 0.05.

This time, I would like to call for pooling in your valuable assessments and thoughts on this patch. I kindly request experiments from you and hopefully you can provide regression cases with perf record -e cycles:u,instructions:u,cache-misses:u reports.

Thank you all so much! 🙇

rustbot · 2025-01-15T10:28:39Z

r? @BoxyUwU

rustbot has assigned @BoxyUwU.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

rustbot · 2025-01-15T10:28:41Z

Some changes occurred in compiler/rustc_codegen_cranelift

cc @bjorn3

Some changes occurred to the CTFE / Miri interpreter

cc @rust-lang/miri

Some changes occurred to MIR optimizations

cc @rust-lang/wg-mir-opt

Some changes occurred to the CTFE machinery

cc @rust-lang/wg-const-eval

bors · 2025-01-19T07:02:24Z

☔ The latest upstream changes (presumably #135715) made this pull request unmergeable. Please resolve the merge conflicts.

BoxyUwU · 2025-01-28T13:26:53Z

I don't think this needs a reviewer?

traviscross · 2025-01-28T23:26:23Z

cc @Darksonn @tmandry @eholk @rust-lang/wg-async

Ding here is reworking the layout of coroutines to try to reduce their memory footprint (and that of Futures). He's curious to find whether this introduces any performance or other regressions. In this own testing, he's not been able to find any, but he's curious in more data and experience here to help inform whether this is a worthwhile change.

What do people think?

tmandry · 2025-01-30T00:47:18Z

For anyone searching for a description of what this PR changes, it's summarized at the top of compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs.

tmandry · 2025-01-30T00:55:41Z

compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs

+//! The reason is that it is possible that coroutine layout may change and the source memory location of
+//! an upvar may not necessarily be mapped exactly to the same place as in the `Unresumed` state.


Don't we decide the offsets of upvars in Unresumed in the same place as we decide the offset of saved locals? Couldn't we then "backpropagate" the field offsets for each upvar's local as the offset for the corresponding upvar?

Thank you for reviewing! I had a backlog of things due to sickness.

True indeed. This statement is completely voided by the work in the second commit. I will reword this section in the following way.

By enabling the feature gate coroutine_new_layout the field offsets of the upvars in Unresumed state are further exactly placed in the same place as their corresponding saved locals, which is guaranteed by the alternative coroutine layout calculator that enters in effect. <... quote the relevant comment/file/etc. ...>

tmandry · 2025-01-30T00:57:38Z

I don't personally have any means of performance testing this at the moment. It would be much easier if it landed behind a feature gate.

bors · 2025-01-31T07:38:08Z

☔ The latest upstream changes (presumably #135318) made this pull request unmergeable. Please resolve the merge conflicts.

nikomatsakis · 2025-02-01T01:04:11Z

Cc @arielb1 who was also investigated this

…

On Wed, Jan 29, 2025, at 7:56 PM, Tyler Mandry wrote: ***@***.**** commented on this pull request. In compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs <#135527 (comment)>: > +//! The reason is that it is possible that coroutine layout may change and the source memory location of +//! an upvar may not necessarily be mapped exactly to the same place as in the `Unresumed` state. Don't we decide the offsets of upvars in `Unresumed` in the same place as we decide the offset of saved locals? Couldn't we then "backpropagate" the field offsets for each upvar's local as the offset for the corresponding upvar? — Reply to this email directly, view it on GitHub <#135527 (review)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AABF4ZTFDPQDUNGH5L6MGSL2NF2CHAVCNFSM6AAAAABVG4UUZ2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMZDKOBSGY4TKMZUHA>. You are receiving this because you are on a team that was mentioned.Message ID: ***@***.***>

dingxiangfei2009 · 2025-02-09T19:20:02Z

@tmandry

I don't personally have any means of performance testing this at the moment. It would be much easier if it landed behind a feature gate.

I think it is fair to land with a feature gate so that we can get to play with it. The PR has temporarily disabled the check on the feature gate. However, given that coroutine layout data is keyed individually by their DefId, I think it is still safe to allow code to link to each other even when the feature gate status varies among the crates.

eholk · 2025-02-12T21:15:58Z

I don't personally have any means of performance testing this at the moment. It would be much easier if it landed behind a feature gate.

Would this be better as a #[feature(...)] gate, or as -Z new_coroutine_layout? I think the compiler flag feels like a better fit for something like this.

oli-obk · 2025-02-13T09:03:57Z

Are there any issues if only one crate activates it but others do not? if there are no issues, a feature gate seems ok (and easier to use ^^)

bors · 2025-02-14T20:49:54Z

☔ The latest upstream changes (presumably #137030) made this pull request unmergeable. Please resolve the merge conflicts.

Dirbaio · 2025-02-15T11:59:44Z

A feature doesn't allow turning it on for the whole build, you'd have to fork every single crate that uses async. A -Z flag would be better IMO.

tmandry · 2025-02-18T19:27:12Z

Agreed on a -Z flag being better for testing for the reason @Dirbaio gave.

If my understanding is correct, we shouldn't expect any regression from this approach (only upside), but since we currently rely on later passes eliding copies there might be some regression. We could be more aggressive in eliding the copies ourselves, but maybe this is hard.

dingxiangfei2009 · 2025-02-19T11:04:04Z

Thanks for looking into this!

I will have time this week to clean this up a bit and I will ask rustbot to set it to ready-for-review.

rustbot · 2025-03-09T22:29:02Z

Some changes occurred in compiler/rustc_codegen_ssa

cc @WaffleLapkin

RalfJung · 2025-07-15T06:13:15Z

compiler/rustc_middle/src/mir/mod.rs

@@ -375,6 +375,9 @@ pub struct Body<'tcx> {
    #[type_foldable(identity)]
    #[type_visitable(ignore)]
    pub function_coverage_info: Option<Box<coverage::FunctionCoverageInfo>>,
+
+    /// Coroutine local-upvar map
+    pub local_upvar_map: IndexVec<FieldIdx, Option<Local>>,


Can you say a bit more about what this map does, and in particular how it affects the operational semantics of MIR? You didn't adjust the interpreter to use this field, so -- is it correct for a backend to entirely ignore this field?

I have added the comment. Basically this map is for diagnostic and asserting region invariance between the relocated locals and the upvars.

RalfJung · 2025-07-15T06:16:08Z

compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs

+//! the base. For instance, `(_1.4 as Some).0` is rewritten into `(_34 as Some).0` when `_34` is the fresh local
+//! corresponding to the captured upvar stored in `_1.4`.
+//!
+//! 3. It assembles an prologue to replace the current entry block.


Suggested change

//! 3. It assembles an prologue to replace the current entry block.

//! 3. It assembles a prologue to replace the current entry block.

Applied in rebase

RalfJung · 2025-07-15T06:17:49Z

compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs

+//!
+//! This prologue block transfers every captured upvar into its corresponding fresh local, *via scratch locals*.
+//! The upvars are first completely moved into the scratch locals in batch, and then moved into the destination
+//! locals in batch.


I do not understand what these "scratch" locals are doing or why they are needed. Could you add an example?

Example added.

The problem lies with a possible permutation of captures when coroutine transit from the unresumed state to other state.

Now that I have given it a good thought and I would like to discuss with the team about a more general fix. I will raise it in the Zulip thread mentioned in the review comment.

RalfJung · 2025-07-15T06:21:40Z

compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs

+//!
+//! Each place that starts with access into the coroutine structure `_1` is replaced with the fresh local as
+//! the base. For instance, `(_1.4 as Some).0` is rewritten into `(_34 as Some).0` when `_34` is the fresh local
+//! corresponding to the captured upvar stored in `_1.4`.


How do we ensure that mutations of the local representing the upvar are properly applied, like for an FnMut? Right now the test sounds like we'd mutate a copy instead which would give the wrong behavior.

I hope that our discussion in #140132 has arrived at the consensus. Otherwise, I am opening a Zulip thread for discussion.

I think the discussion there didn't really reach a consensus, did it? We figured out what the actual proposal is, but there hasn't been consensus for actually doing this. E.g. @compiler-errors sounded far from convinced.

I'm afraid I don't have the bandwidth to follow this proposal so I will have to bow out here, but I think it may need involvement of @rust-lang/types or even t-lang.

Here I initiated a thread: #t-lang > Should we lift captures of coroutines into MIR arguments?

RalfJung · 2025-07-15T06:23:08Z

compiler/rustc_mir_transform/src/coroutine/relocate_upvars.rs

+//! 2. It replaces the places pointing into those upvars with places pointing into those locals instead
+//!
+//! Each place that starts with access into the coroutine structure `_1` is replaced with the fresh local as
+//! the base. For instance, `(_1.4 as Some).0` is rewritten into `(_34 as Some).0` when `_34` is the fresh local


What is the role of the as Some and .0 in this example? I am a bit confused.

I think it'd be good to extend the example and show the entire transform, including the "prologue", on a super simple MIR body.

Alongside a short description, I also attached a new mir-opt test to describe how the prologue would look like.

bors · 2025-07-17T23:58:32Z

☔ The latest upstream changes (presumably #141762) made this pull request unmergeable. Please resolve the merge conflicts.

Dylan-DPC · 2025-08-14T17:07:23Z

@dingxiangfei2009 any updates on answering the review and resolving the conflicts? thanks

dingxiangfei2009 · 2025-08-15T14:45:03Z

@Dylan-DPC Yes, I just finished rebasing today. There was a new test that I have to fix. I will push it at a later hour.

dingxiangfei2009 · 2025-08-15T14:46:10Z

I will also push some updated documentation. I also feel that some items should be discussed in Zulip. Stay tuned.

cjgillot

I'm starting to review this, but I have trouble to understand one of the key design points.

Why is RelocateUpvars separate from StateTransform? Could we start by a change that is localized in StateTransform and then in another PR change analysis MIR?

About the changes to StateTransform itself, is there a way to make the transform use the existing infra? For instance having adding a yield terminator at the end of bb0 which will correspond to the unresumed state?

bors · 2025-08-21T07:22:03Z

☔ The latest upstream changes (presumably #145244) made this pull request unmergeable. Please resolve the merge conflicts.

bors · 2025-08-23T08:33:49Z

☔ The latest upstream changes (presumably #145773) made this pull request unmergeable. Please resolve the merge conflicts.

... and treat coroutine upvar captures as saved locals as well. This allows the liveness analysis to determine which captures are truly saved across a yield point and which are initially used but discarded at first yield points. In the event that upvar captures are promoted, most certainly because a coroutine suspends at least once, the slots in the promotion prefix shall be reused. This means that the copies emitted in the upvar relocation MIR pass will eventually elided and eliminated in the codegen phase, hence no additional runtime cost is realised. Additional MIR dumps are inserted so that it is easier to inspect the bodies of async closures, including those that captures the state by-value. Debug information is updated to point at the correct location for upvars in borrow checking errors and final debuginfo. A language change that this patch enables is now actually reverted, so that lifetimes on relocated upvars are invariant with the upvars outside of the coroutine body. We are deferring the language change to a later discussion. Co-authored-by: Dario Nieuwenhuis <[email protected]>

Signed-off-by: Xiangfei Ding <[email protected]>

rustbot · 2025-08-25T15:31:40Z

This PR was rebased onto a different master commit. Here's a range-diff highlighting what actually changed.

Rebasing is a normal part of keeping PRs up to date, so no action is needed—this note is just to help reviewers.

rust-log-analyzer · 2025-08-25T15:39:00Z

The job tidy failed! Check out the build log: (web) (plain enhanced) (plain)

Click to see the possible cause of the failure (guessed by this bot)

spellcheck files
building external tool typos from package [email protected]
finished building tool typos
npm WARN deprecated [email protected]: This version is no longer supported. Please see https://eslint.org/version-support for other options.
npm ERR! code E403
npm ERR! 403 403 Forbidden - GET https://registry.npmjs.org/zod/-/zod-3.23.8.tgz
npm ERR! 403 In most cases, you or one of your dependencies are requesting
npm ERR! 403 a package version that is forbidden by your security policy, or
npm ERR! 403 on a server you do not have access to.

npm ERR! A complete log of this run can be found in: /home/user/.npm/_logs/2025-08-25T15_38_28_737Z-debug-0.log
tidy error: IO error: npm install returned exit code exit status: 1
npm install did not exit successfully
some tidy checks failed
Command `/checkout/obj/build/x86_64-unknown-linux-gnu/stage1-tools-bin/rust-tidy /checkout /checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo /checkout/obj/build 4 /node/bin/npm --extra-checks=py,cpp,js,spellcheck` failed with exit code 1
Created at: src/bootstrap/src/core/build_steps/tool.rs:1583:23
Executed at: src/bootstrap/src/core/build_steps/test.rs:1225:29

Command has failed. Rerun with -v to see more details.
Bootstrap failed while executing `test --stage 0 src/tools/tidy tidyselftest --extra-checks=py,cpp,js,spellcheck`
Build completed unsuccessfully in 0:02:58
  local time: Mon Aug 25 15:38:50 UTC 2025
  network time: Mon, 25 Aug 2025 15:38:50 GMT
##[error]Process completed with exit code 1.

dingxiangfei2009 · 2025-08-25T17:21:57Z

Thank you so much for reviewing, @cjgillot !

localized in StateTransform

There was a back story here. While I have updated the PR description to explain, I am also reproducing the text here as well.

This idea is explored in #120168 already back in 2023. The conclusion then was that it does not interact well with MIR dataflow analysis. It requires StateTransform pass to assign a virtual "MIR local" to each upvars at the beginning. Apparently this created difficulty in reviewing the piece as soon as we overload this huge StateTransform pass with this additional renumbering work. The idea has always been that it is better to perform the renumbering in its own pass, to keep StateTransform simple.

a yield terminator at the end of bb0

I am afraid this would introduce a breaking change, through which coroutines and futures need to be polled one extra time in order to start driving. This is because in the event construction of coroutines, the control flow does not enter the coroutine body by design.

make the transform use the existing infra

However I have been thinking about this. StateTransform operates exclusively on MIR locals in the body. To maximally utilise the StateTransform infra, it is best if we make upvars representable as MIR locals. This patch demonstrates one way to do so, which is through MIR builder emitting statements to move the upvars from the struct fields into true MIR locals.

There is the other way that motivates me to initiate the discussion on the thread #t-lang > Should we lift captures of coroutines into MIR arguments?. This idea is very tempting because of the following reasons.

We observe that coroutine captures still work in the same way as if we move the upvars into user-defined locals. So there is no breaking changes in the language.
Imagine that we express upvars as _1, _2, etc, in the body arguments, even if they are only formal until the analysis phase, we will need no RelocateUpvars pass and the shenanigans.
We will also furthermore improve the coroutine witness type because it plays both the role as symbolic coroutine state with pure struct fields in MIRs before StateTransform, and mixed struct/enum data with a prefix like a struct, a discriminant and a enum-like suffix after StateTransform. Having either of the two layouts is not a problem, but having both at the same time is for us a probable issue because how one interprets this data and which (symbolic) layout applies depends on which phase this coroutine MIR is passing through. Promoting the upvars into proper body arguments can solve this issue because we do not need to work with the pre-StateTransform data anymore.

rustbot assigned BoxyUwU Jan 15, 2025

rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. labels Jan 15, 2025